Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L07: Pandas I/O
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Preliminaries
  • Relative vs absolute paths
  • Reading and writing .pkl (pickle) files
    • .to_pickle()
    • .read_pickle()
  • Reading and writing .txt and .csv files
    • .to_csv()
    • .read_csv()
  • Reading and writing .xlsx files
    • .to_excel()
    • .read_excel()
  • Some important data-acquisition packages
    • The “requests” package
    • The “pandas_datareader” package
    • The “yfinance” package
    • The “wrds” package

L07: Pandas I/O

Preliminaries

If you have not done so already (you were asked to do this in lecture01):

  1. Open a Terminal, type the following command and hit enter:
    • pip install yfinance pandas-datareader
  2. Open a Terminal, type the following command and hit enter:
    • conda install -y openpyxl xlrd
import pandas as pd
import numpy as np
from pathlib import Path
import requests
import pandas_datareader as pdr
import yfinance as yf

Relative vs absolute paths

In this lecture, we will have to tell Python exactly where on our drive it should store, some files we create. We can do this by specifying the full path to these locations, for example:

‘C:/Users/ionmi/Dropbox/TEACHING’

The above is an absolute path: it contains the full path to the TEACHING folder on my drive.

An alternative way to specify a path (which we will use very often in this course) is to specify that path relative to the current working directory. To do this, we use a combination of one or more dots (‘.’) and/or slashes (‘/’) which have the following meaning:

‘.’ means the current working directory (in our case, this is the directory where these lecture notes are stored on the drive).

‘..’ means the parent of the current working directory.

‘../..’ means the parent of the parent of the current working directory.

‘../../..’ means the parent of the parent of the parent of the current working directory.

etc.

To see this in practice, we can use the Path function in the pathlib package, which allows us to see the absolute path of a given relative path:

Path('.').resolve() #this will give you a different output than mine
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures/lecture08_pandas_io')
Path('..').resolve()
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures')
Path('../..').resolve()
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525')

Reading and writing .pkl (pickle) files

Python has a proprietary data format called “pickle”. These types of files have the extension “.pkl”. Saving and loading data from pickle files is significantly faster than from/to “.csv”, so we will be using it quite a bit throughout the course.

df = pd.DataFrame(data=np.random.rand(5,4), columns = list('ABCD')) #you'll get different numbers every time you re-run this
df
A B C D
0 0.377738 0.380756 0.822901 0.685419
1 0.134739 0.608980 0.171573 0.064065
2 0.208184 0.155670 0.083572 0.922087
3 0.796854 0.193076 0.262668 0.639699
4 0.927714 0.290597 0.763463 0.390035

.to_pickle()

To store this data in a .pkl file, we use the “.to_pickle” function, applied right after the name of the dataframe which contains the data we want to store.

Syntax:

DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None)

Note that the first argument (path) is mandatory (it has no default value). This argument is where you specify the name of the .pkl file you want to create (mydata.pkl below) and the location (directory) where this file should be stored (. below) all in a single string, separated by /.

df.to_pickle('./mydata.pkl')

Note that we can also compress the file:

df.to_pickle('./mydata.zip')

.read_pickle()

To read data from an existing .pkl file, we use the “.read_pickle” function, specifying as an argument the path to the file we want to read (including its name):

Syntax:

pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)

For example, if we want to read the contents of the .pkl file we just created above, and store those contents into a new variable df2, we would use:

df2 = pd.read_pickle('./mydata.pkl')
df2
A B C D
0 0.377738 0.380756 0.822901 0.685419
1 0.134739 0.608980 0.171573 0.064065
2 0.208184 0.155670 0.083572 0.922087
3 0.796854 0.193076 0.262668 0.639699
4 0.927714 0.290597 0.763463 0.390035

And we can read compressed .pkl files too:

df2 = pd.read_pickle('./mydata.zip')
df2
A B C D
0 0.377738 0.380756 0.822901 0.685419
1 0.134739 0.608980 0.171573 0.064065
2 0.208184 0.155670 0.083572 0.922087
3 0.796854 0.193076 0.262668 0.639699
4 0.927714 0.290597 0.763463 0.390035

Note a very important difference in how we use the two functions above. The syntax for .to_pickle() starts with DataFrame.to_pickle which tells us that the function must be applied to and existing DataFrame. On the other hand, the syntax for .read_pickle() starts with pandas.read_pickle, which we converted to pd.read_pickle, because we imported pandas as pd in the first cell code in this notebook (at the top).

This pattern is the same for all the read-write functions we discuss in this lecture: the write functions (.to_pickle(), .to_csv(), .to_excel()) are written after the name of the dataframe we want to write to a file, while the read functions (.read_pickle(), .read_csv(), .read_excel()) follow the name of the pandas package (which we renamed to pd above).

Reading and writing .txt and .csv files

The most common way to read and write dataframes from/to .csv and .txt files is with the Pandas functions “.to_csv()” (for writing) and “.read_csv()” (for reading).

.to_csv()

Here is the abbreviated version of the syntax for .to_csv() excluding parameters that are not used as often:

DataFrame.to_csv(path_or_buf=None, sep=',', columns=None, header=True, index=True, index_label=None)

Note that the default separator is a comma (sep=',') which means we can omit that parameter when we write .csv files. The columns parameter allows you to specify which columns of the dataframe you want to write in the .csv file.

df.to_csv('./mydata.csv', columns = ['B','C'])

To write tab-delimited .txt files, we use sep='\t' and change the file extension to .txt:

df.to_csv('./mydata.txt', sep='\t')

To write space-delimited .txt files, we use sep=' ' (though I always recommend using tabs for .txt files):

df.to_csv('./mydata_space.txt', sep=' ')

.read_csv()

Here is the abbreviated version of the syntax for .read_csv() excluding parameters that are not used as often:

pandas.read_csv(filepath_or_buffer, sep=',',  header='infer', names=None, index_col=None, usecols=None, nrows=None, 
                skiprows = None)

Note that the default separator is a comma (sep=',') which means we can omit that parameter when we read .csv files:

df3 = pd.read_csv('./mydata.csv')
df3
Unnamed: 0 B C
0 0 0.380756 0.822901
1 1 0.608980 0.171573
2 2 0.155670 0.083572
3 3 0.193076 0.262668
4 4 0.290597 0.763463

Note however, that we did not specify that the first column is just an index for the table, so that first column was just included as data in the table itself. Note also that .read_csv() guessed that the column names are on the first row, because the default value of the header parameter is infer. To be save, I always recommend being explicit about where the row names and column names are (remember, Python starts counting from 0:

df3 = pd.read_csv('./mydata.csv', header = 0, index_col = 0)
df3
B C
0 0.380756 0.822901
1 0.608980 0.171573
2 0.155670 0.083572
3 0.193076 0.262668
4 0.290597 0.763463

To read tab-delimited .txt files, we use sep='\t':

df4 = pd.read_csv('./mydata.txt', sep='\t', header = 0, index_col = 0)
df4
A B C D
0 0.377738 0.380756 0.822901 0.685419
1 0.134739 0.608980 0.171573 0.064065
2 0.208184 0.155670 0.083572 0.922087
3 0.796854 0.193076 0.262668 0.639699
4 0.927714 0.290597 0.763463 0.390035

To read space-delimited .txt files, we use sep=' ':

df5 = pd.read_csv('./mydata_space.txt', sep=' ', header = 0, index_col = 0)
df5
A B C D
0 0.377738 0.380756 0.822901 0.685419
1 0.134739 0.608980 0.171573 0.064065
2 0.208184 0.155670 0.083572 0.922087
3 0.796854 0.193076 0.262668 0.639699
4 0.927714 0.290597 0.763463 0.390035

The .to_csv() and .read_csv() functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:

  • .to_csv(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
  • .read_csv(): https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Reading and writing .xlsx files

We’ll use .to_excel() to write Excel files and .read_excel() to read Excel files. The biggest difference from the .csv functions is that, with the excel ones, you can specify a particular sheet in the the Excel file that you want to read/write.

.to_excel()

Here is an abbreviate version of the syntax for .to_excel():

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', columns=None)
df.to_excel('./mydata_excel.xlsx', sheet_name = 's1', columns = ['A', 'C'])

.read_excel()

Here is an abbreviated version of the syntax for .read_excel():

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, skiprows=None, nrows=None)
df6 = pd.read_excel('./mydata_excel.xlsx', sheet_name = 's1', header = 0, index_col=0, engine = 'openpyxl')
df6
A C
0 0.377738 0.822901
1 0.134739 0.171573
2 0.208184 0.083572
3 0.796854 0.262668
4 0.927714 0.763463

The .to_excel() and .read_excel() functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:

  • .to_excel(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
  • .read_excel(): https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Some important data-acquisition packages

The “requests” package

The requests package allows us to retrieve data from websites. If you want a more detailed discussion of the full functionality of this package, see the documentation at https://docs.python-requests.org/en/latest/.

Here, we’ll just see how we can use the package to download data from files hosted on websites. For this, we need to URL to the file we want to download. In the example below, I use data on economic policy uncertainty in the US from this website:

https://www.policyuncertainty.com/us_monthly.html

If you right-click on the “Download Data” link and select “Copy Link Address”, you should see the link below when you paste it in your code:

url = "https://www.policyuncertainty.com/media/US_Policy_Uncertainty_Data.xlsx"

We use the .get() function to retrieve the (binary) data from the URL above:

r = requests.get(url)

We can check if the request was successful using the status_code attribute. 200 means the request was successful, 404 means there was an error.

r.status_code
200

To write the data we retrieved into an Excel file on our computer, we use the Python built-in open() function, specifying the path to the file we want to write the data to ('./policy_uncertainty.xlsx below), specifying that we are writing binary data in it (wb below):

outfile = open('./policy_uncertainty.xlsx', 'wb')
type(outfile)
_io.BufferedWriter

Now we can write the data into that file using the Python built-in write() function and then closing that file with the close() function. Note that the actual data from the URL above is found under the content attribute of the request r that we created above.

outfile.write(r.content)
outfile.close()

We can check if this process worked by either manually opening the policy_uncertainty.xlsx in our working directory, or by using pd.read_excel() to just read the data into a dataframe and take a look at it:

df = pd.read_excel('./policy_uncertainty.xlsx').dropna() #.dropna() means drop rows with missing values
df
Year Month Three_Component_Index News_Based_Policy_Uncert_Index
0 1985 1.0 125.224740 103.748803
1 1985 2.0 99.020809 78.313193
2 1985 3.0 112.190509 100.761475
3 1985 4.0 102.811319 84.778863
4 1985 5.0 120.082716 98.053653
... ... ... ... ...
452 2022 9.0 174.183475 201.738489
453 2022 10.0 177.423127 207.275335
454 2022 11.0 171.737257 210.374894
455 2022 12.0 136.502767 150.156098
456 2023 1.0 143.894044 162.788431

457 rows × 4 columns

The “pandas_datareader” package

The pandas_datareader package allows us to download data from many different sources on the internet. Here is a list of all these sources:

https://pandas-datareader.readthedocs.io/en/latest/readers/index.html

The general syntax to download data from a particular source is as follows:

Syntax:

pandas_datareader.DataReader(name,data_source=None,start=None,end=None)

The two sources I will cover here are the St. Louis Federal Reserve Economic Data (FRED) (data_source = 'fred') which contains a lot of useful macroeconomic data, and the Fama-French Data (Ken French’s Data Library) (data_source = 'famafrench') which contains returns on many portfolios commonly used in asset pricing (e.g. the market portfolio, SMB, HML, etc).

For both of these sources, we use the name parameter to specify what exactly we want to download from these data sources.

For example, to download data on the the CPI from FRED, we need to use name = 'CPIAUCSL' which is the internal name that FRED uses for the CPI data:

https://fred.stlouisfed.org/series/CPIAUCSL

cpi = pdr.DataReader(name = 'CPIAUCSL', data_source = 'fred', start = '2020-09-01', end = '2020-12-31')
cpi
CPIAUCSL
DATE
2020-09-01 260.190
2020-10-01 260.352
2020-11-01 260.721
2020-12-01 261.564

To download data on the Fama-French three risk factors (market, SMB, and HML) we use name = 'F_Research_Data_Factors'" which is the name of the text file containing these factors on Ken French’s website:

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

ff3f = pdr.DataReader(name ='F-F_Research_Data_Factors', data_source='famafrench', start='2020-09-01', end='2020-12-31')
ff3f
{0:          Mkt-RF   SMB   HML    RF
 Date                             
 2020-09   -3.63  0.04 -2.68  0.01
 2020-10   -2.10  4.36  4.21  0.01
 2020-11   12.47  5.82  2.14  0.01
 2020-12    4.63  4.89 -1.51  0.01,
 1:       Mkt-RF    SMB    HML    RF
 Date                            
 2020   23.66  13.17 -46.57  0.45,
 'DESCR': 'F-F Research Data Factors\n-------------------------\n\nThis file was created by CMPT_ME_BEME_RETS using the 202212 CRSP database. The 1-month TBill return is from Ibbotson and Associates, Inc. Copyright 2022 Kenneth R. French\n\n  0 : (4 rows x 4 cols)\n  1 : Annual Factors: January-December (1 rows x 4 cols)'}

Note that for the ‘famafrench’ data source, the ‘DataReader’ function return a dictionary of dataframes, not a single pandas dataframe. That’s because the ‘F-F_Research_Data_Factors’ contains multiple tables. The monthly returns on the Fama-French risk factors are in the first entry in that dictionary (the 0 key), so we can retrieve it like this:

ff3f[0]
Mkt-RF SMB HML RF
Date
2020-09 -3.63 0.04 -2.68 0.01
2020-10 -2.10 4.36 4.21 0.01
2020-11 12.47 5.82 2.14 0.01
2020-12 4.63 4.89 -1.51 0.01

There is no easy way to know under what name you can find the data you need. You have to look at the FRED and Fama-French websites first, to see what names those websites use for the data you need and then type those names into your code, like we did above.

The “yfinance” package

The yfinance package allows us to retrieve stock price data from Yahoo Finance. The full documentation for the package can be found here: https://pypi.org/project/yfinance/ (especially, look under “Fetching data for multiple tickers” on the main page).

Syntax:

yfinance.download(tickers, start = None, end = None, interval = '1d')

For example, to retrieve monthly stock prices for Microsoft and Apple, we need to supply their tickers in a single string (separated by a space) as the first parameter to the download function and change the interval parameter to 1mo (otherwise it will give us daily data):

prc = yf.download(tickers = "MSFT AAPL", start = '2020-09-01', end = '2020-12-31', interval = '1mo')
prc
[*********************100%***********************]  2 of 2 completed
Adj Close Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
Date
2020-09-01 114.239166 206.105713 115.809998 210.330002 137.979996 232.860001 103.099998 196.250000 132.759995 225.509995 3.885245e+09 768176300.0
2020-10-01 107.383453 198.403580 108.860001 202.470001 125.389999 225.210007 107.720001 199.619995 117.639999 213.490005 2.894666e+09 631618000.0
2020-11-01 117.435226 209.770584 119.050003 214.070007 121.989998 228.119995 107.320000 200.119995 109.110001 204.289993 2.123077e+09 573443000.0
2020-11-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-11-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-12-01 131.116043 218.523499 132.690002 222.419998 138.789993 227.179993 120.010002 209.110001 121.010002 214.509995 2.322190e+09 594761700.0

We will always drop missing values (with .dropna()) and use Adj Close prices (prices adjusted for dividends and splits):

aprc = prc['Adj Close'].dropna()
aprc
AAPL MSFT
Date
2020-09-01 114.239166 206.105713
2020-10-01 107.383453 198.403580
2020-11-01 117.435226 209.770584
2020-12-01 131.116043 218.523499

Note that, if we download data for a single stock, this will return a pandas Series, not a DataFrame:

prc2 = yf.download(tickers = "MSFT", start = '2020-09-01', end = '2020-12-31', interval = '1mo')['Adj Close'].dropna()
prc2
[*********************100%***********************]  1 of 1 completed
Date
2020-09-01    206.105698
2020-10-01    198.403580
2020-11-01    209.770599
2020-12-01    218.523514
Name: Adj Close, dtype: float64
type(prc2)
pandas.core.series.Series

As mentioned before, we will usually turn Series into dataframes before continuing to work with them further:

aprc2 = prc2.to_frame()
aprc2
Adj Close
Date
2020-09-01 206.105698
2020-10-01 198.403580
2020-11-01 209.770599
2020-12-01 218.523514

The “wrds” package

The wrds package allows us to download data directly from the WRDS database. Unfortunately, this functionality is not available for class accounts like the ones I created for this course. I only mention this package for PhD students, who should be able to use this package with their own individual WRDS credentials. The documentation for this package is found here:

https://pypi.org/project/wrds/